Version 1: 4 January 2023
# Import libraries
import os
import pandas as pd
import numpy as np
import re
import json
import tweepy
import time
import pytz
from datetime import datetime
from langdetect import detect, lang_detect_exception
from nltk.corpus import stopwords
from nrclex import NRCLex
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from textblob import TextBlob
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import plotly.graph_objects as go
import plotly.express as px
# Set timezone
os.environ['TZ'] = 'Asia/Singapore'
time.tzset()
On 29 November 2022, the Singapore Parliament voted to repeal a law criminalising gay sex, known as Section 377A of the Penal Code in a historic move. The law, introduced under British colonial rule in 1938, had been in force for over 80 years. Its repeal was widely celebrated by activists and members of the LGBTQ+ community, who had been calling for its repeal since as early as 2007 in Singapore (Channel News Asia, 2022b). On the other hand, some conservative members of society, such as religious groups, have expressed disappointment.
At the same time as the repeal, however, the Singapore Parliament also amended the Constitution to introduce a new Institution of Marriage clause in Singapore’s constitution. The constitutional amendment officially protects the definition of marriage as between a man and woman from legal challenges in Singapore. While gay sex is no longer criminal, it seems that Singapore society is still not ready to accept marriages that are not between a man and a woman.
Hence, we see that there are both progressive and conservative elements in Singapore society surounding the issue of 377A and opinions seem to be polarised. This project aims to understand public sentiments surrounding the recent repeal of 377A and degree of support for or opposition to it. It primarily uses data from Twitter and Reddit. Twitter is an open social network which individuals use to converse with one another in 'tweets', or short messages. Reddit is a social news aggregation and discussion website. I have chosen Twitter and Reddit as they are two social media platforms which Singaporeans use regularly to discuss and debate current affairs, particularly youths, and their API is easily accessible for scraping data.
The flow of my project and the main packages used will be as follows:
In this project, I use the following data:
1) google_377a.csv: A csv file containing data on Google trends for the keyword '377A'
2) tweets: A Pandas DataFrame containing tweets scraped from Twitter with the keyword '377A'. Scraped using the Python library Tweepy.
3) Reddit data, scraped using the package subreddit-comments-dl from https://github.com/pistocop/subreddit-comments-dl
comments_singapore_p1: A csv file containing Reddit comments scraped from the Reddit thread r/Singapore from 21 August 2022 to 1 November 2022. r/Singapore is the main thread on Reddit for news and discussions relating to Singapore with 565,000 members.comments_singapore_p2: A csv file containing Reddit comments scraped from the Reddit thread r/Singapore from 2 November 2022 to 31 December 2022.comments_singaporeraw_p1: A csv file containing Reddit comments scrapped from the Reddit thread r/SingaporeRaw from 21 August 2022 to 1 November. r/SingaporeRaw is a less strictly censored and regulated version of r/Singapore, with 47,000 members.comments_singaporeraw_p2: A csv file containing Reddit comments scrapped from the Reddit thread r/SingaporeRaw from 2 November to 31 December 2022.Main question: Is the overall public sentiment towards the repeal of 377A positive or negative?
Further questions:
To determine the period over which we should scrape data, I use Google web searches as a proxy to track the interest in the topic 377A over time. Using Google Trends, I downloaded data of the Google web searches performed in Singapore in 2021 for the keyword '377A'.
The data can be found here: https://trends.google.com/trends/explore?geo=SG&q=377A.
I insert vertical lines to mark two significant dates on the graph:
# Import data downloaded from Google Trends
df_google = pd.read_csv('google_377a.csv', skiprows=1)
df_google.columns = ['date', 'interest']
# Clean up data
def remove_inequalities(value): # Google Trends lists some values with '<'. We want to get rid of this to plot our graph.
'''
Removes the '<' inequality
'''
if '<' in value:
value = float(value[1]) - 1
else:
value = float(value)
return value
df_google['interest'] = df_google['interest'].apply(lambda x: remove_inequalities(x))
# Create a figure
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_google['date'],
y=df_google['interest'],
mode='lines'))
fig.update_layout(title='Interest in 377A in Singapore by Google Search',
xaxis_title='Date',
yaxis_title='Interest',
autotypenumbers='convert types')
# Add vertical lines for important dates
dates = ['2022-08-21', '2022-11-28']
for date in dates:
fig.add_shape(type = "line",
x0=date, y0=0, x1=date, y1=100,
line = dict(color="red", dash='dot'))
fig.add_trace(go.Scatter(
x=dates,
y=[105, 105],
text=['National Day Rally',
'Parliament Debates'],
mode="text"))
fig.update_layout(showlegend=False)
fig.show()
We can see a sudden spike in interest in 377A from 21 August. This is the first time Prime Minister Lee announced that the government would repeal 377A.
Hence, I choose to scrap data from 21 August to the end of the year, 31 December.
Note: Due to limitations with Twitter's Developer Account (Elevated Access), I am currently only able to scrap tweets from the last 30 days in the develop sandbox. Hence, I have scrapped tweets from the period 1 December 2022 to 31 December 2022. In future versions of this project, I will seek to scrap tweets over the full period 21 August 2022 to 31 December 2022.
# Get Twitter API Credentials
twitter_credentials = pd.read_csv('twitter_credentials.csv')
consumer_key = twitter_credentials['keys'][0]
consumer_secret = twitter_credentials['keys'][1]
access_token = twitter_credentials['keys'][2]
access_token_secret = twitter_credentials['keys'][3]
# Create the Authentication Object
authenticate = tweepy.OAuthHandler(consumer_key, consumer_secret)
# Set the Access Token
authenticate.set_access_token(access_token, access_token_secret)
# Create the API object while passing through authentication information
api = tweepy.API(authenticate, wait_on_rate_limit = True)
# Find out the tweet type
def identify_tweet_type(tweet):
'''
Identifies whether a tweet is an original tweet, reply tweet, quote tweet or retweet.
Parameters
----------
tweet: Tweet object
Tweet downloaded from the Twitter API using Tweepy
Returns
----------
tweet_type: str
The type of the given tweet
'''
if tweet["in_reply_to_status_id"] is not None: # Check whether there is a reply
tweet_type = "Reply Tweet"
elif tweet["is_quote_status"] is True and not tweet["text"].startswith("RT"): # Check if it is a quote tweet and not a retweet of a quote tweet
tweet_type = "Quote Tweet"
elif tweet["text"].startswith("RT") and tweet.get("retweeted_status") is not None: # This checks if it is a retweet
tweet_type = "Retweet"
else:
tweet_type = "Original Tweet"
return tweet_type
# Process the tweets into a list of dicts
processed_tweets = []
def process_tweet(status):
'''
Processes tweets scraped from the Twitter API into a list of dict.
Parameters
----------
status: Status object
Status obejct in Tweepy contains information about a tweet
Returns
----------
tweet_dict: list of dict
Each dict represents a tweet and its attributes
'''
for tweet in status:
if identify_tweet_type(tweet._json) == 'Retweet': # Get the full text of the retweet
if 'extended_tweet' in tweet._json['retweeted_status']:
full_text = tweet._json['retweeted_status']['extended_tweet']['full_text']
else:
full_text = tweet._json['retweeted_status']['text']
else:
if 'extended_tweet' in tweet._json: # Get the full text of the extended tweet
full_text = tweet._json['extended_tweet']['full_text']
else:
full_text = tweet._json['text']
tweet_dict = {"id": tweet._json["id_str"], # Only keep the relevant attributes
"date":tweet._json["created_at"],
"full_text": full_text,
"tweet_type": identify_tweet_type(tweet._json),
"hyperlink": "https://twitter.com/twitter/status/" + tweet._json["id_str"]
}
processed_tweets.append(tweet_dict)
# Scrape tweets from Twitter using Tweepy
for page in tweepy.Cursor(api.search_30_day, label='twitteranalysis', query="\"377A\"").pages(30):
process_tweet(page)
tweets = pd.DataFrame(processed_tweets) # Save processed tweets into a Pandas DataFrame for easy cleaning and analysis later
# Remove duplicate tweets
tweets = tweets.drop_duplicates(subset=['full_text'])
# Remove tweets in foreign languages
def find_lang(text):
'''
Identifies the language of a given text
Parameters
----------
text: str
Original text
Returns
----------
lang: str
Language of text detected
'''
try:
lang = detect(str(text))
except lang_detect_exception.LangDetectException:
lang = 'UNKNOWN'
return lang
# Filter out tweets that are not in English
tweets['language'] = tweets['full_text'].apply(lambda x: find_lang(x)) # Add column to dataframe identifying language
tweets = tweets[tweets['language']=='en'] # Only keep tweets that are in English
tweets = tweets.drop('language', axis=1) # Remove the language column
# Only keep tweets mentioning '377A' or '377a' exactly
tweets = tweets[tweets['full_text'].str.contains('377A', case = False)]
# Reverse the order of the tweets so that they are in chronological order
tweets = tweets.iloc[::-1]
# Reset index
tweets = tweets.reset_index(drop=True)
# Check new df
tweets.head()
| id | date | full_text | tweet_type | hyperlink | |
|---|---|---|---|---|---|
| 0 | 1598126745726910464 | Thu Dec 01 01:28:02 +0000 2022 | @ChannelNewsAsia 377A keep for so many years t... | Reply Tweet | https://twitter.com/twitter/status/15981267457... |
| 1 | 1598129259809472513 | Thu Dec 01 01:38:02 +0000 2022 | Small achievements for this month:\n1. Me bein... | Original Tweet | https://twitter.com/twitter/status/15981292598... |
| 2 | 1598141674374692864 | Thu Dec 01 02:27:21 +0000 2022 | 否极泰来 Piji Tailai: 377A is a beautification of ... | Original Tweet | https://twitter.com/twitter/status/15981416743... |
| 3 | 1598142702566387712 | Thu Dec 01 02:31:27 +0000 2022 | @shuyeonify salty about 377a revoked ig | Reply Tweet | https://twitter.com/twitter/status/15981427025... |
| 4 | 1598153715764318209 | Thu Dec 01 03:15:12 +0000 2022 | Of cos it is not. But for control freak, socie... | Original Tweet | https://twitter.com/twitter/status/15981537157... |
# Convert timezones from UTC to local time
def utc_to_local(date):
'''
Converts time from the UTC timezone to the local timezone.
Parameters
----------
date: str
Date in the UTC timezone
Returns
----------
date: datetime object
Date in the local timezone
'''
date = datetime.strptime(date, '%a %b %d %H:%M:%S %z %Y')
local_tz = datetime.now().astimezone().tzinfo
date = date.replace(tzinfo=pytz.utc).astimezone(local_tz)
return date
tweets['date'] = tweets['date'].apply(lambda x: utc_to_local(x))
tweets.head()
| id | date | full_text | tweet_type | hyperlink | |
|---|---|---|---|---|---|
| 0 | 1598126745726910464 | 2022-12-01 09:28:02+08:00 | @ChannelNewsAsia 377A keep for so many years t... | Reply Tweet | https://twitter.com/twitter/status/15981267457... |
| 1 | 1598129259809472513 | 2022-12-01 09:38:02+08:00 | Small achievements for this month:\n1. Me bein... | Original Tweet | https://twitter.com/twitter/status/15981292598... |
| 2 | 1598141674374692864 | 2022-12-01 10:27:21+08:00 | 否极泰来 Piji Tailai: 377A is a beautification of ... | Original Tweet | https://twitter.com/twitter/status/15981416743... |
| 3 | 1598142702566387712 | 2022-12-01 10:31:27+08:00 | @shuyeonify salty about 377a revoked ig | Reply Tweet | https://twitter.com/twitter/status/15981427025... |
| 4 | 1598153715764318209 | 2022-12-01 11:15:12+08:00 | Of cos it is not. But for control freak, socie... | Original Tweet | https://twitter.com/twitter/status/15981537157... |
# Function to clean the text
def clean_text(text):
'''
Cleans the text for analysis by removing punctuation and symbols
'''
text = re.sub(r'@[A-Za-z0-9_]+', '', text) # Removes @mentions
text = re.sub(r'#', '', text) # Removes the # symbol
text = re.sub(r'RT[\s]+', '', text) # Removes RT/retweets
text = re.sub(r'https?:\/\/\S+', '', text) # Removes hyperlinks
text = re.sub(r'\n', ' ', text) # Removes new lines
text = re.sub(r'\\n', ' ', text) # Removes new lines
text = re.sub(r'>', '', text) # Removes new lines
return text
# Clean the tweets
tweets['full_text'] = tweets['full_text'].apply(lambda x: clean_text(x))
tweets.head()
| id | date | full_text | tweet_type | hyperlink | |
|---|---|---|---|---|---|
| 0 | 1598126745726910464 | 2022-12-01 09:28:02+08:00 | 377A keep for so many years to wait for singa... | Reply Tweet | https://twitter.com/twitter/status/15981267457... |
| 1 | 1598129259809472513 | 2022-12-01 09:38:02+08:00 | Small achievements for this month: 1. Me being... | Original Tweet | https://twitter.com/twitter/status/15981292598... |
| 2 | 1598141674374692864 | 2022-12-01 10:27:21+08:00 | 否极泰来 Piji Tailai: 377A is a beautification of ... | Original Tweet | https://twitter.com/twitter/status/15981416743... |
| 3 | 1598142702566387712 | 2022-12-01 10:31:27+08:00 | salty about 377a revoked ig | Reply Tweet | https://twitter.com/twitter/status/15981427025... |
| 4 | 1598153715764318209 | 2022-12-01 11:15:12+08:00 | Of cos it is not. But for control freak, socie... | Original Tweet | https://twitter.com/twitter/status/15981537157... |
We get a DataFrame of tweets on 377A in the month of December 2022, len=296.
Using the package subreddit-comments-dl (https://github.com/pistocop/subreddit-comments-dl) and my reddit API login keys, I scraped comments from two Reddit threads which discuss current events in Singapore — r/Singapore and r/SingaporeRaw.
# Import scraped Reddit data
singapore_1 = pd.read_csv('comments_singapore_p1.csv')
singapore_2 = pd.read_csv('comments_singapore_p2.csv')
singaporeraw_1 = pd.read_csv('comments_singaporeraw_p1.csv')
singaporeraw_2 = pd.read_csv('comments_singaporeraw_p2.csv')
reddit_comments = pd.concat([singapore_1, singapore_2, singaporeraw_1, singaporeraw_2], axis=0)
# Function for converting timezones from Unix time to local time in datetime object
def unix_to_local(time_recorded):
'''
Converts timezones from Unix time to local time
Parameters
----------
time_recorded : str
Time recorded in Unix time (number of seconds that have elapsed since 00:00:00 UTC on 1 January 1970)
Returns
----------
time_recorded_datetime : datetime object
Time recorded in local time in the format YYYY-MM-DD HH:MM:SS
'''
time_recorded_datetime = datetime.fromtimestamp(time_recorded)
return time_recorded_datetime
# Clean reddit data
def clean_reddit(reddit_df):
'''
Cleans raw data downloaded from Reddit using `subreddit-comments-dl` for analysis
'''
reddit_df = reddit_df[reddit_df['body'].str.contains('377A', regex=False, case=False, na=False)] # Only keep comments mentioning '377A' or '377a'
reddit_df = reddit_df.drop_duplicates(subset=['body']) # Drop any duplicates
reddit_df = reddit_df.drop(['submission_id','parent_id'], axis=1) # Drop columns that are not required
reddit_df.rename(columns = {'body':'full_text', 'created_utc': 'date'}, inplace=True)
reddit_df = reddit_df[['id', 'date', 'full_text', 'subreddit', 'permalink']]
reddit_df['full_text'] = reddit_df['full_text'].apply(lambda x: clean_text(x)) # Clean up text
reddit_df['date'] = reddit_df['date'].apply(lambda x: unix_to_local(x)) # Convert unix to local time in datetime object
reddit_df.sort_values(by=['date'], inplace=True)
reddit_df = reddit_df.reset_index(drop=True)
return reddit_df
reddit_comments = clean_reddit(reddit_comments)
reddit_comments.head()
| id | date | full_text | subreddit | permalink | |
|---|---|---|---|---|---|
| 0 | il52jwx | 2022-08-21 09:42:38 | Singapore's religious leaders outline positi... | Singapore | /r/singapore/comments/wtnckb/singapores_religi... |
| 1 | il537nc | 2022-08-21 09:48:01 | Basically 377a repeal is coming and that relig... | Singapore | /r/singapore/comments/wtnckb/singapores_religi... |
| 2 | il53f7j | 2022-08-21 09:49:44 | Responding to questions about the matter, Arc... | Singapore | /r/singapore/comments/wtnckb/singapores_religi... |
| 3 | il546hb | 2022-08-21 09:55:53 | One of the leader of a Buddhist Group (Buddhis... | Singapore | /r/singapore/comments/wtnckb/singapores_religi... |
| 4 | il54htu | 2022-08-21 09:58:25 | Archbishop Goh, in the interview with The Stra... | Singapore | /r/singapore/comments/wtnckb/singapores_religi... |
We get a DataFrame of Reddit comments on 377A, len=497.
After cleaning up the data, I perform sentiment analysis to generate a quantitative meaasure of the sentiment. This would allow me to answer my research questions. I use three NLP packages - TextBlob, NRCLex and nltk.sentiment.vader with different functions for this.
Firstly, I use TextBlob to detect subjectivity and polarity.
Subjectivity quantitatively measures the degree of personal feeling and factual information in the text (Liu, 2012). The subjectivity value falls within the range [0, 1]. The higher the subjectivity, the more personal bias the text contains. Polarity quantitatively measures the level of positivity or negativity in the text. The polarity value falls within the range [-1, 1], with -1 indicating a negative sentiment and 1 indicating a positive sentiment (ibid). [-1,1]. In our case, it is useful for providing a general overview of whether people feel positively or negatively about the repeal of 377A and how much they care about the issue.
# Function for detecting subjectivity
def get_subjectivity(text):
return TextBlob(text).sentiment.subjectivity
# Function for detecting polarity
def get_polarity(text):
return TextBlob(text).sentiment.polarity
Secondly, I use nltk.sentiment.vader to quantitatively measure the overall positivity or negativity of the sentiments. VADER, which stands for Valence Aware Dictionary for sEntiment Reasoning, is a rule-based model by Hutto and Gilbert (2014). Although TextBlob does offer a measure of positivity and negativity in its 'polarity' function, I choose to use VADER as well as VADER has been shown to provide results of high accuracy, even outperforming human raters (ibid). VADER is also able to detect neutral text. Using VADER, we are able to more accurately discern the level of positivity or negativity in online sentiments towards the repeal of 377A.
# Function for VADER analysis
def vader_analysis(text):
sid = SentimentIntensityAnalyzer()
sentiments = text.apply(sid.polarity_scores)
return sentiments
Thirdly, I use NRCLex to detect emotions. The NRCLex package quantitatively measures the emotional affects of a given text (Bailey, 2019). It draws on the National Research Council Canada (NRC) affect lexicon and the NLTK library’s WordNet synonym sets (ibid). The emotional effects it measures are fear, anger, anticipation, trust, surprise, positive, negative, sadness, disgust and joy. It is useful in helping to identify the specific emotions that people feel towards the repeal of 377A, rather than just 'positive' or 'negative'.
# Function for detecting emotions
def get_emotions(text):
text = NRCLex(text)
return text.affect_frequencies
# Function for creating a DataFrame with sentiment analysis
def create_sentiment_df(text_df):
'''
Generates a DataFrame of sentiment analysis data for text, including emotion analysis, subjectivity, polarity and VADER analysis
Parameters
----------
text_df:
Pandas DataFrame with data of text to be analysed
Returns
----------
sentiments_df:
Pandas DataFrame with text and corresponding sentiment analysis
'''
# Adding emotion analysis from NRCLex
emotion_list = []
for text in text_df['full_text']:
emotion_list.append(get_emotions(text)) # We get a list of dict
sentiment_df = pd.DataFrame.from_dict(emotion_list)
sentiment_df.insert(0,'full_text','') # Insert column with original full text
sentiment_df['full_text'] = text_df['full_text']
sentiment_df.insert(0,'date','') # Insert column with original date
sentiment_df['date'] = text_df['date']
sentiment_df = sentiment_df.drop(['anticip'], axis=1) # NRCLex sometimes creates a duplicate with an 'anticipation' and an 'anticip' column. We get rid of the 'anticip' column as all the values are '0.0'
sentiment_df.replace(to_replace= np.nan, value = float(0), inplace = True)
# Adding subjectivity
sentiment_df['subjectivity'] = sentiment_df.apply(lambda x: get_subjectivity(x['full_text']), axis=1)
# Adding polarity
sentiment_df['polarity'] = sentiment_df.apply(lambda x: get_polarity(x['full_text']), axis=1)
# Adding VADER analysis to df
sentiments = vader_analysis(text_df['full_text'])
scores = ['neg', 'neu', 'pos', 'compound']
for score in scores:
sentiment_df['vader_' + score] = [sentiment[score] for sentiment in sentiments]
# Reordering columns
sentiment_df = sentiment_df[['full_text', 'date',
'subjectivity', 'polarity',
'fear', 'anger', 'sadness', 'disgust', 'negative',
'trust', 'surprise', 'joy', 'anticipation', 'positive',
'vader_neg', 'vader_neu', 'vader_pos', 'vader_compound']]
return sentiment_df
# Generating sentiment analysis results for reddit comments
reddit_sentiments = create_sentiment_df(reddit_comments)
reddit_sentiments.head()
| full_text | date | subjectivity | polarity | fear | anger | sadness | disgust | negative | trust | surprise | joy | anticipation | positive | vader_neg | vader_neu | vader_pos | vader_compound | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Singapore's religious leaders outline positi... | 2022-08-21 09:42:38 | 0.327857 | 0.100679 | 0.043478 | 0.017391 | 0.034783 | 0.008696 | 0.060870 | 0.313043 | 0.017391 | 0.104348 | 0.165217 | 0.234783 | 0.020 | 0.884 | 0.096 | 0.9900 |
| 1 | Basically 377a repeal is coming and that relig... | 2022-08-21 09:48:01 | 0.471818 | 0.053636 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.125000 | 0.250000 | 0.000000 | 0.125000 | 0.250000 | 0.250000 | 0.000 | 0.942 | 0.058 | 0.4404 |
| 2 | Responding to questions about the matter, Arc... | 2022-08-21 09:49:44 | 0.395089 | 0.197768 | 0.061538 | 0.015385 | 0.030769 | 0.000000 | 0.076923 | 0.292308 | 0.030769 | 0.092308 | 0.169231 | 0.230769 | 0.019 | 0.863 | 0.118 | 0.9774 |
| 3 | One of the leader of a Buddhist Group (Buddhis... | 2022-08-21 09:55:53 | 0.133333 | 0.000000 | 0.000000 | 0.166667 | 0.000000 | 0.000000 | 0.000000 | 0.333333 | 0.000000 | 0.000000 | 0.000000 | 0.500000 | 0.051 | 0.899 | 0.051 | 0.0000 |
| 4 | Archbishop Goh, in the interview with The Stra... | 2022-08-21 09:58:25 | 0.497034 | -0.170802 | 0.054795 | 0.041096 | 0.041096 | 0.000000 | 0.164384 | 0.150685 | 0.041096 | 0.095890 | 0.150685 | 0.260274 | 0.109 | 0.845 | 0.046 | -0.9624 |
# Generating sentiment analysis results for tweets
twitter_sentiments = create_sentiment_df(tweets)
twitter_sentiments.head()
| full_text | date | subjectivity | polarity | fear | anger | sadness | disgust | negative | trust | surprise | joy | anticipation | positive | vader_neg | vader_neu | vader_pos | vader_compound | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 377A keep for so many years to wait for singa... | 2022-12-01 09:28:02+08:00 | 0.500000 | 0.500000 | 0.142857 | 0.000000 | 0.000000 | 0.000000 | 0.142857 | 0.142857 | 0.0 | 0.142857 | 0.285714 | 0.142857 | 0.000 | 0.856 | 0.144 | 0.4939 |
| 1 | Small achievements for this month: 1. Me being... | 2022-12-01 09:38:02+08:00 | 0.616667 | -0.116667 | 0.142857 | 0.142857 | 0.000000 | 0.142857 | 0.142857 | 0.285714 | 0.0 | 0.000000 | 0.000000 | 0.142857 | 0.000 | 0.886 | 0.114 | 0.7003 |
| 2 | 否极泰来 Piji Tailai: 377A is a beautification of ... | 2022-12-01 10:27:21+08:00 | 0.100000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.333333 | 0.0 | 0.333333 | 0.000000 | 0.333333 | 0.000 | 0.734 | 0.266 | 0.4404 |
| 3 | salty about 377a revoked ig | 2022-12-01 10:31:27+08:00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000 | 1.000 | 0.000 | 0.0000 |
| 4 | Of cos it is not. But for control freak, socie... | 2022-12-01 11:15:12+08:00 | 0.470833 | 0.029167 | 0.000000 | 0.000000 | 0.333333 | 0.333333 | 0.333333 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.093 | 0.859 | 0.048 | -0.4606 |
# Finding Keywords
stop_words = set(stopwords.words('english'))
def filter_words(sentiments):
all_words = ' '.join([words for words in sentiments['full_text']]).split(' ')
all_words = [word for word in all_words if word not in stop_words]
all_words = [word for word in all_words if word.isalpha()]
filtered_words = ' '.join(all_words)
return filtered_words
twitter_words = filter_words(twitter_sentiments)
reddit_words = filter_words(reddit_sentiments)
For both the Twitter and Reddit data, I plot a graph of subjectivity (x-axis) against polarity (y-axis).
# Plotting polarity and subjectivity
def polarity_subjectivity_plot(sentiments, color, title):
'''
Parameters:
----------
sentiments (df)
color (str)
Returns
----------
'''
plt.figure(figsize=(8,6))
for i in range(0, sentiments.shape[0]):
plt.scatter(sentiments['polarity'][i], sentiments['subjectivity'][i], color=color)
plt.title(title)
plt.xlabel('Polarity')
plt.ylabel('Subjectivity')
return plt.show()
polarity_subjectivity_plot(twitter_sentiments, 'blue', 'Twitter Sentiments')
polarity_subjectivity_plot(reddit_sentiments, 'red', 'Reddit Sentiments')
While the Reddit graph is denser, which is to be expected as our data size for Reddit (497) is larger than the size of our data for Twittter (296), the graphs are similar in shape. We observe a roughly V-shaped graph, which is consistent with the literature on sentiment analysis. This is because the higher the level of subjectivity, the more personal opinion it contains, thus the more likely it is strongly positive or strongly negative in polarity.
From both graphs, we also see that more dots lie on the right of the vertical line in the centre, where polarity = 0. This indicates that sentiments on the repeal of 377A are more positive than negative. There does not seem to be a significant difference in terms of overall sentiment on Twitter versus Reddit.
In the graph for Reddit comments, there are a few dots which lie exactly on the vertical line in the centre with polarity = 0 and subjectivity within [0, 0.2]. This indicates that these comments are quite neutral and factual. These comments could be excerpts from news articles, such as Reddit bots that automatically scrape and post news articles.
Using the vader_compound value of each text, I plot a boxplot. The VADER compound value is the sum of the VADER positive, VADER negative and VADER neutral scores, normalised to fall within the range [-1, 1] (Patil et al, 2019). The compound score reveals the polarity of the text. The nearer the compound score is to 1, the more positive it is; the nearer the compound score is to -1, the more negative it is.
# Plot boxplot for VADER compound scores
def plot_vader(sentiments_df, color, title):
fig = go.Figure()
fig.add_trace(go.Box(x=sentiments_df['vader_compound'],
line=dict(color=color),
boxmean=True,))
fig.update_layout(title=title,
xaxis_title='VADER Compound Score',
yaxis_visible=False)
return fig.show()
plot_vader(twitter_sentiments, 'blue', 'VADER Analysis of 377A Tweets')
plot_vader(reddit_sentiments, 'red', 'VADER Analysis of Reddit Comments')
Unlike TextBlob's analysis, the VADER analysis reveals clear differences between the polarity of sentiments on Twitter versus on Reddit. The mean for the VADER compound score of Tweets is 0.0944 and the median is 0. This indicates that on average, sentiments relating to 377A on Twitter are neutral. The mean for the VADER compound score of Reddit comments is 0.176, while the median is 0.237. Both values are low positive values, indicating that Reddit comments relating to 377A are overall slightly more positive.
Notably, the lower and upper fences of Twitter scores are -0.572 and 0.944 respectively. The same values for Reddit are -0.997 and 0.998, which are very close to -1 and 1 respectively. This indicates that there is a broader spectrum of opinions on Reddit than on Twitter, such as more strongly positive and more strongly negative ones.
Lastly, I plot a scatter figure of the emotional affect values generated by NRCLex for 10 emotions. This is plotted against time. I also plot a regression line for each of the emotions to estimate the trend in emotions over time. This is useful in helping us identify changes in feelings across time, if any, such as from the announcement of the repeal in late August and the announcement that 377A had been successfully repealed in late November. However, this is more useful for Reddit data, as my Twitter data only spans 1-31 December.
# Conducting emotions analysis
def emotions_scatter_fig(sentiments, title):
scatter_fig = px.scatter(sentiments,
x='date',
y=sentiments.columns[4:14],
trendline = 'ols',
title = title,
width=800,
height=800)
return scatter_fig.show()
emotions_scatter_fig(twitter_sentiments, 'Emotions Analysis of Tweets on 377A')
Most of the emotional affects have values close to 0, indicating that they are not significant. I have picked out a few significant emotions to analyse.
Firstly, there was a high level of sadness compared to the other emotions at around the start of December, which was right after 377A was repealed on 29 November. However, this was brief and the level of sadness detected in tweets on 377A declined as the month went on. Notably, we see that there are several 'very sad' tweets right after 4 December with a perfect sadness value of 1. These outliers are responsible for bringing up the average level of 'sadness' detected in the tweets. Most of the tweets are actually below the value of 0.5 for sadness. Hence, apart from a few outliers, the 'sadness' level on average is not that high.

In contrast, the regression line for 'joy' is quite close to the x-axis, with the largest value on the line being y=0.0579. This indicates that overall levels of 'joy' are not very high. At the start of December, the 'joy' level measured in most of the tweets fall within the range [0, 0.3], indicating low levels of joy. This shows that Twitter users were not very joyful about the repeal of 377A.

Finally, the regression line for 'positive' emotional affect sits higher above the x-axis than the line for 'negative' emotional affect, although absolute values measured are still relatively low. This confirms our earlier analysis that Twitter sentiments towards the repeal of 377A are more positive than negative.

emotions_scatter_fig(reddit_sentiments, 'Emotions Analysis of Reddit Comments on 377A')
Most of the data points are clustered in August 21-September 5, as well as November 27-November 30. This means that there was a spike in discussions of 377A during these periods. This finding is consistent with the periods of high numbers of Google searches for '377A', as from the Google Trends data earlier, and is due to the Prime Minister's announcement of the repeal on 21 August and the actual repealing on 29 November.
The data also shows points concentrated around the October 20-23 period, indicating greater discussion of 377A during those days on Reddit. This is because on October 20, the bill to repeal 377A was tabled in Parliament while the constitutional amendment to the definition of marriage was introduced (Channel News Asia, 2022a). Although this news did not result in a significant interest in Google searches for 377A, it did spark discussion on Reddit.
Turning our attention to the various emotional effects, we see that the regression lines have gentle gradients and are almost horizontal. This means that the degree of the various emotional effects are quite consistent across the period and there are little fluctuations. Some emotional effects, such as 'surprise' and 'disgust' are very close to the x-axis, showing that there is little 'surprise' and 'disgust' detected in Reddit comments on 377A.
The emotional affect with the second highest values, based on its linear regression line, is 'trust'. Although it is not clear who the 'trust' is placed in, it is plausible that this represents the people's trust in the government on the issue of the repealing of 377A. Similarly, 'trust' also measures relatively high values compared to other emotional affets in the emotional analysis graph for Twitter.

The level of 'fear' appears to be quite low, but is not insignificant. All measures of fear lie below the value of 0.5 but many lie within the range [0, 0.3], showing that many Reddit commentators express a small degree of fear regarding 377A. Notably, unlike other emotional affects, there are no outliers of fear value = 1 for the 'fear' emotional affect.

Based on the linear regression line, the affect with the highest values is 'positive', while that with the third highest value is 'negative'. As with the Twitter emotional effect graph, the positive values appear higher than the negative values on the whole. This corroborates with our earlier analysis that the repeal of 377A was received with more positive than negative sentiments.

Finally, using the package WordCloud, we can plot a word cloud using the keywords generated earlier to explore any issues or commonly-held opinions surrounding the repeal of 377A using . A word cloud provides a visual representation of the words used in the text, with words that appear more commonly appearing bigger. This is also useful in enabling us to compare the similarities and differences between the debate around the repeal of 377A on Twitter versus Reddit.
# Plotting a word cloud with keywords
def plot_word_cloud(all_words):
'''
Plots a word cloud given a string of keywords
'''
word_cloud = WordCloud(width = 700, height = 500, random_state = 21, max_font_size = 120, collocations = False).generate(all_words)
plt.imshow(word_cloud, interpolation = 'bilinear')
plt.axis('off')
return plt.show()
plot_word_cloud(twitter_words)
plot_word_cloud(reddit_words)
Observing both word clouds, we see that some keywords are common, such as 'Singapore', 'repeal', 'gay' and 'LGBT'. These are expected as they are keywords relating to the 377A law.
In the Twitter word cloud, we see that the words 'turn', 'page', 'dark' and 'history' are very prominent. This is likely because many Twitter users were tweeting, quote tweeting or retweeting the BBC Article, '377A repeal: Singapore turns page on dark LGBT history' (link: https://www.bbc.co.uk/news/world-asia-63832825). It is interesting to observe that a news article from a British news source on a law repeal in Singapore is the most mentioned. This could suggest that many of the Twitter users tweeting about 377A are actually international, and not Singaporean citizens, who tend to read news on Singapore from local news outlets, such as The Straits Times or CNA.
In the Reddit word cloud, we see that the words 'marriage' and 'law' are very large. This suggests that Reddit users were commenting on the government's constitutional amendment to the definition of marriage in late November, although their opinion on the issue is not clear from the word cloud. In contrast, these words are less prominent in the Twitter word cloud.
Based on the data from all three packages, we see that the overall public sentiment towards the repeal of 377A is more positive than negative. However, this positive opinion is not strong.
Upon further research, this finding is consistent with the findings of surveys conducted in Singapore. In a survey in end August, Blackbox Research found that 43% of Singaporean adults expressed support for the repeal, while 21% opposed the repeal (The Straits Times, 2022), demonstrating that more held positive sentiments than negative ones towards the repeal. 34% remained neutral and 2% did not state their stand.
Although the constitutional amendment was discussed, particularly on Reddit, the overall sentiment towards the repeal of 377A appears to still be positive.
Unlike mainstream media narratives of conservative religious groups being at loggerheads with liberal LGBTQ+ activists, opinion within society on the issue of 377A is not very polarised. Most people express slightly positive or roughly neutral sentiments, rather than strong positive or negative sentiments.
Reading the data from the emotional affects graphs, apart from the 'positive' and 'negative' affects, the emotional affects 'trust' and 'anticipation' rank highly compared to other emotions. This could suggest that the people trust the government and anticipated the repeal of 377A or more progressive change in society. However, these values are not high, indicating weak rather than strong emotions.
Observing the linear regression lines in the Reddit emotional analysis graph, we see that they have gentle gradients. This shows that there has actually been little change in emotions or sentiments over time, from the announcement of the government's intention to repeal to the actual repealing.
Firstly, the Twitter dataset is not satisfactory as it only covers the period 1-31 December. Ideally, we should scrap Twitter data from 21 August to 31 December. This would also provide a better basis for comparison against Reddit data and allow us to track changes in sentiments in tweets across time. Unfortunately, Academic Access to the Twitter APIs is required for conducting a full_archive_search with tweepy, which I am currently unable to acccess.
Secondly, it is hard to confirm whether commentators are Singaporean or not. For Reddit, it is likely that most users who post on the r/Singapore and r/Singapore are Singaporeans as it is a niche only community. However, it is hard to confirm whether Twitter users posting about '377A' are Singaporeans or not as they are not tweeting within a specific community. Given the longstanding nature of 377A and the conservative nature of Asian society, the Singapore's government's decision to repeal it attracted the attention of and was lauded by many international onlookers, particularly in other Asian countries. The fact that the BBC article, rather than an article from a local news outlet, was often mentioned in tweets could suggest that these Twitter users were not from Singapore. This limits the effectiveness of this project in analysing Singaporeans' sentiments towards the repeal.
Thirdly, to gain a more complete picture of public sentiments, we can scrap data from other social media platforms. Twitter and Reddit may not be representative of the sentiments of the population, particularly that of middle-aged adults who may prefer to use Facebook, or younger teenagers who prefer to use Instagram and TikTok. However, it would similarly be difficult to confirm whether commentators are Singaporean or not.
Firstly, the project was unable to capture sentiments expressed in other languages. As Singapore is a multi-ethnic society with four main ethnic groups, it is likely that some opinions on social media on Singapore may be expressed in other languages, such as Malay or Mandarin. Looking through the Twitter and Reddit data briefly, we see phrases such as '彩虹沙河' in Mandarin, which translates to 'Rainbow Sand River', likely indicating the user's support for LGBTQ+ rights. Such sentiments are not captured by this project.
Secondly, for more accurate analysis of keywords or if further analysis of keywords is required, stemming and lemmatisation can be performed using nltk packages. I did not stem or lemmatise the keywords in this project as my focus was not on keywords analysis and I did not want to run the risk of incorrectly stemming or lemmatising a keyword.
Bailey, M. M. (Ed.). (2019). NRCLex: An affect generator based on TextBlob and the NRC affect lexicon. PyPI. https://pypi.org/project/NRCLex/
Channel News Asia. (2022a, October 20). Bills to repeal 377A, amend Constitution to protect definition of marriage tabled in Parliament. CNA. https://www.channelnewsasia.com/singapore/repeal-377a-constitution-amendments-protect-marriage-parliament-bill-first-reading-3016736?cid=FBcna
Channel News Asia. (2022b, November 28). Timeline: Repealing Section 377A and amending the Constitution to protect the definition of marriage. CNA. https://www.channelnewsasia.com/singapore/timeline-repealing-377a-gay-sex-law-amending-singapore-constitution-marriage-definition-3101551
Hutto, C., & Gilbert, E. (2014). VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. Proceedings of the International AAAI Conference on Web and Social Media, 8(1), 216–225. https://ojs.aaai.org/index.php/ICWSM/article/view/14550/14399
Liu, B. (2012). Sentiment Analysis and Opinion Mining. https://www.cs.uic.edu/~liub/FBS/SentimentAnalysis-and-OpinionMining.pdf
Patil, A., R, A., Rayar, S., & K M, V. (2019). Comparison of VADER and LSTM for Sentiment Analysis. International Journal of Recent Technology and Engineering (IJRTE), 7(6S).
The Straits Times. (2022, August 25). Section 377A: Religious groups call for unity; poll finds 43% support repeal, double those against. The Straits Times. www.straitstimes.com. https://www.straitstimes.com/singapore/politics/section-377a-religious-groups-call-for-unity-poll-finds-43-per-cent-support-repeal-double-those-against